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Abstract — Routing  algorithms  in  the  IP  Internet  provide  a  single  path  between  each 
source-destination  pair  and  where  more  than  one  path  is  provided,  they  are  paths 
of  equal  length.  Single-path  routing  is  inherently  slow  in  responding  to  congestion 
and  temporary  traffic  bursts;  multiple  paths  are  better  suited  to  handle  congestion. 
Also  the  paths  provided  in  RIP  and  OSPF  are  not  free  of  loops  during  times  of  net¬ 
work  transition,  which  can  be  debilitating  to  network  performance.  We  present  a 
distributed  routing  algorithm  for  computing  multiple  paths  that  need  not  have  equal 
length  between  each  source-destination  pair  in  a  computer  network  such  that  they 
are  loop-free  at  every  instant — in  steady  state  as  well  as  during  network  transitions. 
The  algorithm  is  scalable  to  large  networks  as  it  uses  only  one-hop  synchronization 
which  is  unlike  diffusing  computations  that  require  internodal  synchronization  span¬ 
ning  multiple  hops.  The  safety  and  liveness  properties  of  the  algorithm  are  proven  and 
its  complexity  is  analyzed. 

I.  Introduction 

The  most  popular  routing  protocols  used  in  today’s  internets  are 
based  on  the  exchange  of  vectors  of  distances  (e.g.,  RIP  [7]  and 
EIGRP  [2])  or  topology  maps  (e.g.,  OSPF  [11]).  RIP  and  many  other 
routing  protocols  based  on  the  distributed  Bellman-Ford  algorithm 
(DBF)  for  shortest-path  computation  suffer  from  the  bouncing  effect 
and  the  counting-to-infinity  problems,  which  limits  their  applicability 
to  small  networks  using  hop  count  as  the  measure  of  distance.  OSPF 
and  algorithms  based  on  topology-broadcast  (e.g.,  [15],  [12]  )  incur  too 
much  communication  overhead,  which  forces  the  network  administra¬ 
tors  to  partition  the  network  into  areas  connected  by  a  backbone.  This 
makes  OSPF  complex  in  terms  of  router  configuration  required.  EIGRP 
uses  a  loop-free  routing  algorithm  called  DUAL  [3],  which  is  based  on 
internodal  coordination  that  can  span  multiple  hops. 

In  addition  to  DUAL,  several  algorithms  based  on  distance  vectors 
have  been  proposed  to  overcome  the  counting-to-infinity  problem  of 
DBF  [14],  [10],  [9],  [17].  All  of  these  algorithms  rely  on  exchanging 
queries  and  replies  along  multiple  hops,  a  technique  that  is  sometimes 
called  diffusing  computations,  because  it  has  its  origin  in  Dijkstra  and 
Scholten’s  basic  algorithm  [1], 

A  couple  of  routing  algorithms  have  been  proposed  that  operate  us¬ 
ing  partial  topology  information  [4],  [6]  to  eliminate  the  main  limita¬ 
tion  of  topology-broadcast  algorithms.  Furthermore,  several  distributed 
shortest-path  algorithms  [8],  [13],  [5]  have  been  proposed  that  use  the 
distance  and  second-to-last  hop  to  destinations  as  the  routing  informa¬ 
tion  exchanged  among  nodes.  These  algorithms  are  often  called  path¬ 
finding  algorithms  or  source-tracing  algorithms.  All  these  algorithms 
eliminate  DBF’s  counting  to  infinity  problem,  and  some  of  them  [5]  are 
more  efficient  that  any  of  the  routing  algorithms  based  on  link-state  in¬ 
formation  proposed  to  date.  Furthermore,  LPA  [5]  is  loop-free  at  every 
instant. 

This  work  was  supported  in  part  by  the  Defense  Advanced  Research  Projects  Agency  (DARPA)  under 
grants  F30602-97- 1-0291  and  F19628-96-C-0038. 


With  the  exception  of  DASM  [17],  all  of  the  above  routing  algo¬ 
rithms  focus  on  the  provision  of  a  single  path  to  each  destination.  A 
drawback  of  DASM,  however,  is  that  it  uses  multi-hop  synchroniza¬ 
tion,  which  limits  its  scalability.  Recently,  we  presented  MPDA  [16] 
which  is  the  first  routing  algorithm  based  on  link-states  that  provides 
multiple  loop-free  paths  using  one-hop  synchronization.  In  this  paper, 
we  present  a  variant  of  MPDA  called  MPATH,  which  is  the  first  routing 
algorithm  based  on  distance  vectors  that  (a)  provides  multiple  paths  of 
unequal  cost  to  each  destination  that  are  free  of  loops  at  every  instant 
—  in  steady  state  as  well  as  during  network  transitions,  and  (b)  uses 
a  synchronization  mechanism  that  spans  only  one  hop,  which  makes 
it  more  scalable  than  routing  algorithms  based  on  diffusing  computa¬ 
tions  spanning  multiple  hops.  MPATH  is  a  path-finding  algorithm,  and 
differs  from  prior  similar  algorithms  in  the  invariants  used  to  ensure 
multiple  loop-free  paths  of  unequal  cost.  The  peculiar  differences  be¬ 
tween  MPATH  and  MPDA  is  a  result  of  the  differences  in  the  kind  of 
information  that  nodes  exchange. 

Section  II  describes  MPATH.  Section  III  presents  the  correctness 
proofs  showing  that  MPATH  is  loop-free  at  every  instant,  safe,  and 
live.  Section  IV  analyzes  the  complexity  of  MPATH.  Section  V  pro¬ 
vides  concluding  remarks. 

II.  Distributed  Multipath  Routing  Algorithm 
A.  Problem  Formulation 

A  computer  network  is  represented  as  a  graph  G  =  ( N ,  L )  where  N 
is  set  of  nodes  (routers)  and  L  is  the  set  of  edges  (links)  connecting  the 
nodes.  A  cost  is  associated  with  each  link  and  can  change  over  time,  but 
is  always  positive.  Two  nodes  connected  by  a  link  are  called  adjacent 
nodes  or  neighbors.  The  set  of  all  neighbors  of  a  given  node  i  is  denoted 
by  A”.  Adjacent  nodes  communicate  with  each  other  using  messages 
and  messages  transmitted  over  an  operational  link  are  received  with  no 
errors,  in  the  proper  sequence,  and  within  a  finite  time.  Furthermore, 
such  messages  are  processed  by  the  receiving  node  one  at  a  time  in 
the  order  received.  A  node  detects  the  failure,  recovery  and  link  cost 
changes  of  each  adjacent  link  within  a  finite  time. 

The  goal  of  our  distributed  routing  algorithm  is  to  determine  at  each 
node  i  the  successor  set  of  i  for  destination  j,  which  we  denote  by 
Sj  (f)  C  N\  such  that  the  routing  graph  SGj  (f )  consisting  of  link  set 
{(m,n)\n  G  SJ1'  (t).  m  G  N}  is  free  of  loops  at  every  instant  f,  even 
when  link  costs  are  changing  with  time.  The  routing  graph  SGj(t) 
for  single-path  routing  is  a  sink-tree  rooted  at  j ,  because  the  successor 
sets  Sj(t)  have  at  most  one  member.  In  multipath  routing,  there  can  be 
more  than  one  member  in  Sj  (i) ;  therefore,  SGj  (t)  is  a  directed  acyclic 
graph  with  j  as  the  sink  node.  There  are  potentially  several  SGj(t) 
for  each  destination  j;  however,  the  routing  graph  we  are  interested  is 
defined  by  the  successor  sets  SJt)  =  {k\Dj  (t.)  <  DJtffk  G  N‘}, 
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Procedure  INIT-PATH 

{Invoked  when  the  node  comes  up. } 

1.  Initialize  all  tables. 

2.  Run  PATH  algorithm. 

End  INIT-PATH 

Algorithm  PATH 

{Invoked  when  a  message  M  is  received  from  neighbor  k, 

or  an  adjacent  link  to  k  has  changed  or  when  a  node  is 

initialized. } 

1.  Run  NTU  to  update  neighbor  tables. 

2.  Run  MTU  to  update  main  tables. 

3.  For  each  destination  j  marked  as  changed. 

Add  update  entry  [j,  Dj ,  p{]  to  the  new  message  M' . 

4.  Within  finite  amount  of  time,  send  message  M'  to 
each  neighbor. 

End  PATH 

Fig.  1 .  The  PATH  Algorithm 

where  D{  is  the  shortest  distance  of  node  i  to  destination  j.  We  call 
such  a  routing  graph  the  shortest  multipath  for  destination  j. 

After  a  series  of  link  cost  changes  which  leave  the  network  topol¬ 
ogy  in  arbitrary  configuration,  the  distributed  routing  algorithm  should 
work  to  modify  SGj  in  such  a  way  that  it  eventually  converges  to  the 
shortest  multipath  of  the  new  configuration,  without  ever  creating  a 
loop  in  SGj  during  the  process. 

Because  Dj  is  node  k’s  local  variable,  its  value  has  to  be  explic¬ 
itly  or  implicitly  communicated  to  j.  If  D ' h  is  the  value  of  Dj 
as  known  to  node  i,  the  problem  now  becomes  one  of  computing 
Sj{t)  =  {k\D'jk(t)  <  Dj(t.)}.  However,  because  of  non-zero  prop¬ 
agation  delays,  during  network  transitions  there  can  be  discrepancies 
in  the  value  of  Dj  and  its  copy  D’  k  at  i,  which  may  cause  loops  to 
form  in  SGj .  To  prevent  loops,  therefore,  additional  constraints  must 
be  imposed  when  computing  S{ .  We  show  later  that  if  the  successor  set 
at  each  node  i  for  each  destination  j  satisfy  certain  conditions  called 
loop-free  invariant  conditions,  then  the  snapshot  at  time  f  of  the  routing 
graph  SGj(t)  implied  by  S{  (f )  is  free  of  loops.  Our  solution  to  this 
problem  consists  of  two  parts:  (1)  computing  D'  using  a  shortest-path 
routing  algorithm  called  PATH  and  (2)  extending  it  to  compute  S"  such 
that  they  satisfying  loop-free  invariant  conditions  at  every  instant. 

B.  Node  Tables  and  Message  Structures 

As  in  DBF.  nodes  executing  MPATH  exchange  messages  containing 
distances  to  destinations.  In  addition  to  the  distance  to  a  destination, 
nodes  also  exchange  the  identity  of  the  second-to-last  node,  also  called 
predecessor  node,  which  is  the  node  just  before  the  destination  node 
on  the  shortest  path.  In  this  respect  MPATH  is  akin  to  several  prior 
algorithms  [5],  [13],  [8],  but  differs  in  its  specification,  verification  and 
analysis  and,  more  importantly,  in  the  multipath  operation  described  in 
the  next  section. 

The  following  information  is  maintained  at  each  node  i: 

1 .  The  Main  Distance  Table  contains  Dj  and  p{ ,  where  Dj  is  the 
distance  of  node  i  to  destination  j  and  p{  is  the  predecessor  to  des¬ 
tination  j  on  the  shortest  path  from  i  to  j.  The  table  also  stores  for 
each  destination  j .  the  successor  set  S{ ,  feasible  distance  FD{, 
reported  distance  RDj  and  two  flags  changed  and  report-it. 

2.  The  Main  Link  Table  T ’  is  the  node’s  view  of  the  network  and 
contains  links  represented  by  ( m ,  n.  d )  where  (m,  nj  is  a  link 
with  cost  d. 

3.  The  Neighbor  Distance  Table  for  neighbor  k  contains  D{k  and 
p'jk  where  D{k  is  the  distance  of  neighbor  k  to  j  as  communicated 


Procedure  NTU 

{Called  by  PATH  to  process  an  event. } 

1.  If  event  is  a  message  M  from  neighbor  k, 

a.  For  each  entry  [j,  d,  p\  in  M  //Note  d  =  Dj  .  p  =  pj.) 

Set  D)k  t-  d  and  p)k  «-  p. 

b.  For  each  destination  j  with  an  entry  in  M, 

Remove  existing  links  ( n .  j )  in  Tk  and  add  new 
link  (m,  j,  d)  to  TJ.,  where  d  =  D'jk  —  D’mk 
and  m  =  p’jk. 

2.  If  the  event  is  an  adjacent  link-status  change,  update  Vk  and 

clear  neighbor  tables  of  k .  if  link  is  down. 

End  NTU 

Fig.  2.  Neighbor  Table  Update  Algorithm 

by  k  and  p{k  is  the  predecessor  to  j  on  the  shortest  path  from  k  to 
j  as  notified  by  k. 

4.  The  Neighbor  Link  Table  T£  is  the  neighbor  k' s  view  of  the  net¬ 
work  as  known  to  i  and  contains  link  information  derived  from 
the  distance  and  predecessor  information  in  the  neighbor  distance 
table. 

5.  Adjacent  Link  Table  stores  the  cost  Vk  of  adjacent  link  to  each 
neighbor  k.  If  a  link  is  down  its  cost  is  infinity. 

Nodes  exchange  information  using  update  messages  which  have  the 
following  format. 

1 .  An  update  message  can  one  or  more  update  entries.  An  update 
entry  is  a  triplet  [j,  d,  p],  where  d  is  the  distance  of  the  node 
sending  the  message  to  destination  j  and  p  is  the  predecessor  on 
the  path  to  j. 

2.  Each  message  carries  two  flags  used  for  synchronization:  query 
and  reply. 

C.  Computing  Dj 

As  mentioned  earlier,  our  strategy  is  to  first  design  a  shortest-path 
routing  algorithm  and  then  make  the  multipath  extensions  to  it.  This 
subsection  describes  our  shortest-path  algorithm  PATH  and  the  next 
subsection  describes  the  multipath  extensions.  Figure  1  shows  the 
pseudocode  of  PATH.  INIT-PATH  is  called  at  node  startup  to  initial¬ 
ize  the  tables;  distances  are  initialized  to  infinity  and  node  identities 
to  a  null  value.  PATH  is  executed  in  response  to  an  event  that  can  be 
either  a  receipt  of  an  update  message  from  a  neighbor  or  detection  of 
an  adjacent  link  cost  or  link  status  (up/down)  change.  PATH  invokes 
procedure  NTU.  described  in  Figure  2,  which  first  updates  the  neigh¬ 
bor  distance  tables  and  then  updates  Tk  with  links  (to,  n,  d)  where 
d  =  D'nk  —  D'mk  and  to  =  p’nk.  PATH  then  invokes  procedure  MTU, 
specified  in  Figure  5,  which  constructs  T'  by  merging  the  topologies 
Tk  and  the  adjacent  links  l'k. 

The  merging  process  is  straightforward  if  all  neighbor  topologies 
Tk  contain  consistent  link  information,  but  when  two  or  more  neigh¬ 
bors  link  tables  contain  conflicting  information  regarding  a  particular 
link,  the  conflict  must  be  resolved.  Two  neighbor  tables  are  said  to 
contain  conflicting  information  regarding  a  link,  if  either  both  report 
the  link  with  different  cost  or  one  reports  the  link  and  the  other  does 
not.  Conflicts  are  resolved  as  follows:  if  two  or  more  neighbor  link 
tables  contain  conflicting  information  of  link  (to,  n),  then  T‘  is  up¬ 
dated  with  link  information  reported  by  the  neighbor  k  that  offers  the 
shortest  distance  from  the  node  i  to  the  head  node  to  of  the  link,  i.e., 
l'k  -F  D'mk  =  min{Tk  +  D’mk\k  €  N1}.  Ties  are  broken  in  a  consis¬ 
tent  manner;  one  way  is  to  break  ties  always  in  favor  of  lower  address 
neighbor.  Because  i  itself  is  the  head  of  the  link  for  adjacent  links,  any 
information  about  an  adjacent  link  supplied  by  neighbors  will  be  over¬ 
ridden  by  the  most  current  information  about  the  link  available  to  node 
i.  Figure  4  shows  the  significance  of  the  tie-breaking  rule. 


(b) 

Table  showing  the  preferred  neighbors. 
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Fig.  3.  Example  illustrating  the  main  table  update  procedure,  (a)  Shows  the  adjacent 
links  and  neighbor  tables  of  node  i.  (b)  Shows  the  main  link  table  i  after  merging  the 
neighbor  tables 


Fig.  4.  Significance  of  the  tie-breaking  Rule,  (a)  An  example  network  with  unit  link  costs, 
(b)  Node  i  has  the  costs  of  its  adjacent  links  and  the  shortest  path  trees  of  its  neighbors 
p  and  q.  The  distances  of  nodes  x  and  y  from  i  is  identical  through  both  neighbors  p 
and  q.  (c)  If  MTU  breaks  ties  in  arbitrary  manner  while  constructing  T 1 ,  it  may  choose 
p  as  the  preferred  neighbor  for  node  x  and  choose  q  as  preferred  neighbor  for  node  y, 
resulting  in  a  graph  that  has  no  path  from  i  to  j.  Ties,  therefore,  cannot  be  broken  in 
arbitrary  manner. 

After  merging  the  topologies.  MTU  runs  Dijkstra’s  shortest  path  al¬ 
gorithm  to  find  the  shortest  path  tree  and  deletes  all  links  from  Tl  that 
are  not  in  the  tree.  Because  there  can  be  more  than  one  shortest-path 
tree,  while  running  Dijkstra’s  algorithm  ties  are  again  broken  in  a  con¬ 
sistent  manner.  The  distances  Dj  and  predecessors  pj  can  then  be  ob¬ 
tained  from  Tl.  The  tree  is  compared  with  the  previous  shortest  path 
tree  and  only  the  differences  are  then  reported  to  the  neighbors.  If  there 
are  no  differences,  no  updates  are  reported.  Eventually  all  tables  con¬ 
verge  such  that  Dj  give  the  shortest  distances  and  all  message  activity 
will  cease.  The  proofs  are  given  in  section  III. 

D.  Computing  Sj 

In  this  subsection,  the  final  desired  routing  algorithm  MPATH  is  de¬ 
rived  by  making  extensions  to  PATH.  MPATH  computes  the  successor 
sets  Sj  by  enforcing  the  Loop-free  Invariant  conditions  described  be¬ 
low  and  using  a  neighbor-to-neighbor  synchronization. 


Procedure  MTU 

1.  Clear  link  table  T*. 

2.  For  each  node  j  ^  i  occurring  in  at  least  one  7jj , 

a.  Find  MIN  <-  min{D'k  +  l\\k  E  1V!}. 

b.  Let  n  be  such  that  MIN  =  ( Dj n  +  Vn).  Ties  are 
broken  consistently.  Neighbor  n  is  the  preferred  neighbor 
for  destination  j.  For  each  link  (j,  v,  d)  in  Tj, 

Add  link  ( j ,  v,  d)  to  T' . 

3.  Update  T*  with  each  link  Vk. 

4.  Run  Dijkstra’s  shortest  path  algorithm  on  T’  to 
find  new  Dj ,  and  pj . 

5.  For  each  destination  j.  if  Dj  or  pj  changed  front 
previous  value,  set  changed  and  report-it  flags  for  j. 

End  MTU 

Fig.  5.  Main  Table  Update  Algorithm 

Let  FDj ,  called  the  feasible  distance,  be  an  ’estimate’  of  the  dis¬ 
tance  of  node  i  to  node  j  in  the  sense  that  FDj  is  equal  to  Dj  when 
the  network  is  in  stable  state,  but  to  prevent  loops  during  periods  of 
network  transitions,  it  is  allowed  to  be  temporarily  differ  from  Dj . 

Loop-free  Invariant  Conditions^ LFI)  [  16] : 

FT)  >t)  <  Djj ( f )  /,•  <  A  (1) 

Sj(t)  =  {  k  |  Djk(t)  <  FDj(t,)}  (2) 

The  invariant  conditions  (1)  and  (2)  state  that,  for  each  destination 
j,  a  node  i  can  choose  a  successor  whose  distance  to  j ,  as  known  to  i, 
is  less  than  the  distance  of  node  i  to  j  that  is  known  to  its  neighbors. 

Theorem  1:  [16]  If  the  LFI  conditions  are  satisfied  at  any  time  t, 
the  SGj  (f )  implied  by  the  successor  sets  Sj  (f )  is  loop-free. 

Proof:  Let  k  £  Sj  (f )  then  from  (2)  we  have 

Djfc(f)  <  FDj  (f)  (3) 

At  node  k.  because  node  i  is  a  neighbor,  from  (1)  we  have 

FDj  (f )  <  Dj,(M  (4) 

Combining  (3)  and  (4)  we  get 

FDj  (t)  <  FDj  (f)  (5) 

Eq.(5)  states  that,  if  A:  is  a  successor  of  node  i  in  a  path  to  destina¬ 
tion  j.  then  k’ s  feasible  distance  to  j  is  strictly  less  than  the  feasible 
distance  of  node  itoj.  Now,  if  the  successor  sets  define  a  loop  at  time 
f  with  respect  to  j .  then  for  some  node  p  on  the  loop,  we  arrive  at  the 
absurd  relation  FDj’(f)  <  FDj’(f).  Therefore  the  LFI  conditions  are 
sufficient  for  loop-freedom.  ■ 

The  invariants  used  in  LFI  are  independent  of  whether  the  algorithm 
uses  link  states  or  distance  vectors;  in  link-state  algorithms,  such  as 
MPDA,  the  Djk  are  computed  locally  from  the  link-states  communi¬ 
cated  by  the  neighbors  while  in  distance-vector  algorithms,  like  the 
MPATH  presented  here,  the  Djk  are  directly  communicated. 

The  invariants  (1)  and  (2)  suggest  a  technique  for  computing  Sj  (t) 
such  that  the  successor  graph  SGj  (f )  for  destination  j  is  loop-free  at 
every  instant.  The  key  is  determining  FDj  (f )  in  Eq.  (1),  which  requires 
node  i  to  know  Djj  (t),  the  distance  from  i  to  node  j  in  the  topology 


Procedure  INIT-MPATH 

{Invoked  when  the  node  comes  up. } 

1.  Initialize  tables  and  run  MPATH. 

End  INIT-MPATH 

Algorithm  M PAT H 

{Invoked  when  a  message  M  is  received  from  neighbor  k, 

or  an  adjacent  link  to  k  has  changed .} 

1.  Run  NTU  to  update  neighbor  tables. 

2.  Run  MTU  to  obtain  new  D)  and  p) . 

3.  If  node  is  PASSIVE  or  node  is  ACTIVE  A  last  reply  arrived, 

Reset  goactive  flag. 

For  each  destination  j  marked  as  report-it, 

a.  UP  <  min{D),RD *■} 

b.  If  D)  >  RD) ,  Set  goactive  flag. 

c.  RD)  <  D) 

d.  Add  [j,  RD),p)\  to  message  M' . 

e.  Clear  report-it  flag  for  j. 

Otherwise,  the  node  is  ACTIVE  and  waiting  for  more  replies. 

For  each  destination  j  marked  as  changed, 

f.  FD)  <-  min  {I)  .  I'D  } 

4.  For  each  destination  j  marked  as  changed, 

a.  Clear  changed  flag  for  j 

b.  5]  «-  {k\D)k  <  FD)} 

5.  For  each  neighbor  k, 

a.  M"  <  AT1. 

b.  If  event  is  a  query  from  k.  Set  reply  flag  in  M" . 

c.  If  goactive  set,  Set  queiy  flag  in  M" . 

d.  If  M"  non-empty,  send  M"  to  k. 

6.  If  goactive  set,  become  ACTIVE,  otherwise 

become  PASSIVE. 

End  MPATH 

Fig.  6.  Multi-path  Loop-free  Routing  Algorithm 

table  T-1  that  node  i  communicated  to  neighbor  k.  Because  of  non-zero 
propagation  delay,  T*  is  a  time-delayed  version  of  T*.  We  observe 
that,  if  node  i  delays  updating  of  FD)  with  D)  until  k  incorporates  the 
distance  D)  in  its  tables,  then  FD)  satisfies  the  LFI  condition. 

Pseudocode  for  MPATH  is  shown  in  Figure  6.  MPATH  enforces 
the  LFI  conditions  by  synchronizing  the  exchange  of  update  messages 
among  neighbors  using  query  and  reply  flags.  If  a  node  sends  a  mes¬ 
sage  with  a  query  bit  set,  then  the  node  must  wait  until  a  reply  is  re¬ 
ceived  from  all  its  neighbors  before  the  node  is  allowed  to  send  the 
next  update  message.  The  node  is  said  to  be  in  ACTIVE  state  during 
this  period.  The  inter-neighbor  synchronization  used  in  MPATH  spans 
only  one  hop.  unlike  algorithms  that  use  diffusing  computation  that  po¬ 
tentially  span  the  whole  network(e.g.,  DASM  [17]). 

Assume  that  all  nodes  are  in  PASSIVE  state  initially  with  correct  dis¬ 
tances  to  all  other  nodes  and  that  no  messages  are  in  transit  or  pending 
to  be  processed.  The  behavior  of  the  network  where  every  node  runs 
MPATH  is  such  that  when  a  finite  sequence  of  link  cost  changes  occurs 
in  the  network  within  a  finite  time  interval,  some  or  all  nodes  to  go 
through  a  series  of  PASSIVE-to-ACTIVE  and  ACTIVE-to-PASSIVE 
state  transitions,  until  eventually  all  nodes  become  PASSIVE  with  cor¬ 
rect  distances  to  all  destinations. 

Let  a  node  in  PASSIVE  state  receive  an  event  resulting  in  changes 
in  its  distances  to  some  destinations.  Before  the  node  sends  an  update 
message  to  report  new  distances,  it  checks  if  the  distance  D)  to  any  des¬ 
tination  j  has  increased  above  the  previously  reported  distance  RD) . 
If  none  of  the  distances  increased,  then  the  node  remains  in  PASSIVE 
state.  Otherwise,  the  node  sets  the  query  flag  in  the  update  message, 
sends  it,  and  goes  into  ACTIVE  state.  When  in  ACTIVE  state,  a  node 


cannot  send  any  update  messages  or  add  neighbors  to  any  successor 
set.  After  receiving  replies  from  all  its  neighbors  the  node  is  allowed 
to  modify  the  successor  sets  and  report  any  changes  that  may  have  oc¬ 
curred  since  the  time  it  has  transitioned  to  ACTIVE  state,  and  if  none 
of  the  distances  increased  beyond  the  reported  distance,  the  node  tran¬ 
sitions  to  PASSIVE  state.  Otherwise,  the  node  sends  the  next  update 
message  with  the  query  bit  set  and  becomes  ACTIVE  again,  and  the 
whole  cycle  repeats.  If  a  node  receives  a  message  with  the  query  bit 
set  when  in  PASSIVE  state,  it  modifies  its  tables  and  then  sends  back 
an  update  message  with  the  reply  flag  set.  Otherwise,  if  the  node  hap¬ 
pens  to  be  in  ACTIVE  state,  it  modifies  the  tables  but  because  the  node 
is  not  allowed  to  send  updates  when  in  ACTIVE  state,  the  node  sends 
back  an  empty  message  with  no  updates  but  the  reply  bit  set.  If  a  re¬ 
ply  from  a  neighbor  is  pending  when  the  link  to  the  neighbor  fails  then 
an  implicit  reply  is  assumed,  and  such  a  reply  is  assumed  to  report  an 
infinite  distance  to  the  destination.  Because  replies  are  given  immedi¬ 
ately  to  queries  and  replies  are  assumed  to  be  given  upon  link  failure, 
deadlocks  due  to  inter-neighbor  synchronization  cannot  occur.  Eventu¬ 
ally,  all  nodes  become  PASSIVE  with  correct  distances  to  destinations, 
which  we  prove  in  the  next  section. 

III.  Correctness  of  MPATH 

The  following  properties  of  MPATH  must  be  proved:  (1)  MPATH 
eventually  converges  with  D)  giving  the  shortest  distances  and  (2)  the 
successor  graph  SGj  is  loop-free  at  every  instant  and  eventually  con¬ 
verges  to  the  shortest  multipath.  PATH  works  essentially  like  PDA[16] 
except  that  the  kind  of  update  information  exchanged  is  different;  PDA 
exchanges  link-state  while  PATH  exchanges  distance-vectors  with  pre¬ 
decessor  information.  The  correctness  proof  of  PATH  is  identical  to 
PDA  and  are  reproduced  here  for  correctness.  The  convergence  of 
MPATH  directly  follows  from  the  convergence  of  PATH  because  ex¬ 
tensions  to  MPATH  are  such  that  update  messages  in  MPATH  are  only 
delayed  a  finite  amount  of  time. 

Definitions:  The  n-hop  minimum  distance  of  node  i  to  node  j  in  a 
network  is  the  minimum  distance  possible  using  a  path  of  n  hops(links) 
or  less.  A  path  that  offers  the  n-hop  minimum  distance  is  called  n-hop 
minimum  path.  If  there  is  no  path  with  n  hops  or  less  from  node  i  to  j 
then  the  n-hop  minimum  distance  from  i  to  j  is  undefined.  An  n-hop 
minimum  tree  of  a  node  i  is  a  tree  in  which  node  i  is  the  root  and  all 
paths  of  n  hops  or  less  from  the  root  to  any  other  node  is  an  n-hop 
minimum  path. 

Let  G  denote  the  final  topology  of  the  network,  as  seen  by  an  om¬ 
niscient  observer,  after  all  link  changes  occurred.  (We  use  bold  font 
to  refer  to  quantities  in  G).  Without  loss  of  generality,  assume  G  is 
connected;  if  G  is  disconnected,  the  proof  applies  to  each  connected 
component  independently. 

We  say  that  a  router  i  knows  at  least  the  n-hop  minimum  tree,  if  the 
tree  contained  in  its  main  link  table  T’  is  at  least  an  n-hop  minimum 
tree  rooted  at  i  in  G  and  there  are  at  least  n  nodes  in  T‘  that  are  reach¬ 
able  from  the  root  i.  Note  that  T!  is  such  that  the  links  with  head  nodes 
that  are  more  than  n  hops  away  from  i  may  have  costs  that  do  not  agree 
with  the  link  costs  in  G. 

Theorem  2:  If  node  i  has  adjacent  link  costs  that  agree  with  G  and 
for  each  neighbor  k,  T’k  represents  at  least  an  ( n  —  l)-hop  minimum 
tree,  then  after  the  execution  of  MTU,  the  minimum  cost  tree  contained 
in  Tl  is  at  least  an  n-hop  minimum  tree. 

Proof:  The  proof  is  identical  to  the  proof  of  Lemma  1  in  [16]  and 
is  provided  in  the  appendix  for  convenient  reference.  ■ 

Theorem  3:  A  finite  time  after  the  last  link  cost  change  in  the  net¬ 
work,  the  main  topology  T’  at  each  node  i  gives  the  correct  shortest 
paths  to  all  known  destinations. 

Proof:  The  proof  is  identical  to  the  proof  of  Theorem  2  in  [16] 


and  is  provided  in  the  appendix  for  convenient  reference.  ■ 

A  node  generates  update  messages  only  to  report  changes  in  dis¬ 
tances  and  predecessor,  so  after  convergence  no  messages  will  be  gen¬ 
erated.  The  following  theorems  show  that  MPATH  provides  instanta¬ 
neous  loop-freedom  and  correctly  computes  the  shortest  multipath. 

Theorem  4:  For  the  algorithm  MPATF1  executed  at  node  i,  let  f  „  be 
the  time  when  RD'j  is  updated  and  reported  for  the  n-th  time.  Then, 
the  following  conditions  always  hold. 

FDj  (fn)  <  min{RDj(tn-i),RDj(tn)}  (6) 
I- 1 Kin  <  FD)(tn )  i  e  (7) 

Proof:  From  the  working  of  MPATH  in  Fig.  6,  we  observe  that 
RD'j  is  updated  at  line  3c  when  (a)  the  node  goes  from  PASSIVE- 
to-ACTIVE  because  of  one  or  more  distance  increases  (b)  the  node 
receives  the  last  reply  and  goes  from  ACTIVE-to-PASSIVE  state  (c) 
the  node  is  in  PASSIVE  state  and  remains  in  PASSIVE  state  because 
the  distance  did  not  increase  for  any  destination  (d)  the  node  receives 
the  last  reply  but  immediately  goes  into  ACTIVE  state.  The  reported 
distance  RD'j  remains  unchanged  during  the  ACTIVE  phase.  Because 
FDj  is  updated  at  line  3a  each  time  RD'j  is  updated  at  line  3c,  Eq.  (6) 
follows.  When  the  node  is  in  ACTIVE  phase,  FDj  may  also  be  modi¬ 
fied  by  the  statement  on  line  3f,  which  implies  Eq.  (7).  ■ 

Theorem  5:  (Safety  property)  At  any  time  f,  the  successor  sets 
Sj  ( t )  computed  by  MPATH  are  loop-free. 

Proof:  The  proof  is  based  on  showing  that  the  FDj  and  Sj  com¬ 
puted  by  MPATH  satisfy  the  LFI  conditions.  Let  tn  be  the  time  when 
RD'j  is  updated  and  reported  for  the  n-th  time.  The  proof  is  by  induc¬ 
tion  on  the  interval  [tn ,  tn+i].  Let  the  LFI  condition  be  true  up  to  time 
tn,  we  show  that 


FD)  (t)  < 

Front  Theorem  4  we  have 

Dji(t)  t  6  [fn,fn+i] 

(8) 

FDj  (t„ ) 

< 

min{RDj(tn-i),  RD)(tn)} 

(9) 

FD'j(tn+ 1) 

< 

min{RDj  (tn),  RD'j  (f„+i)} 

GO) 

FD){t) 

< 

FDj  (t„ )  t€[t„,tn+l) 

(11) 

Combining  the  above  equations  we  get 

FDj  (t)  <  min{RDj(tn-i),RDj(tn)}  t  g  [t„ ,  f„+i](12) 


Let  t'  be  the  time  when  message  sent  by  i  at  t„  is  received  and 
processed  by  neighbor  k.  Because  of  the  non-zero  propagation  delay 
across  any  link,  t!  is  such  that  tn  <  f  <  tn+i  and  because  RD'j  is 


modified  at  f  „  and  remains  unchanged  in 

(tn,tn  + 1)  we  get 

RD)(tn-l)  < 

Dji  (f) 

t  €  [t.„,t!) 

(13) 

iw  it;)  < 

Djdt) 

t  €  [f7 ,  i„+i] 

(14) 

Front  Eq.  (13)  and  (14)  we  get 

min{RDj  (tn-i ),  RD'j  (t, 

.)}  < 

Dji(t)  fe[fr!,f„+i] 

(15) 

From  (12)  and  (15)  the  inductive  step  (8)  follows.  Because  FDj  (to)  < 
Djft.o)  at  initialization,  from  induction  we  have  that  FD'j(t)  < 
Dji(t)  for  all  t.  Given  that  the  successor  sets  are  computed  based  on 
FDj ,  it  follows  that  the  LFI  conditions  are  always  satisfied.  According 
to  the  Theorem  1  this  implies  that  the  successor  graph  SGj  is  always 
loop-free.  ■ 

Theorem  6:  (Liveness  property)  A  finite  time  after  the  last  change 
in  the  network,  the  D’  give  the  correct  shortest  distances  and  S]  = 
{/.  />'  <  D),k  C  V  }. 

Proof:  The  proof  is  similar  to  the  proof  of  Theorem  4  in  [16]  and 
is  provided  in  the  appendix  for  convenience.  ■ 


IV.  Complexity  Analysis 

The  main  difference  between  PATH  and  MPATH  is  that  the  update 
messages  sent  in  MPATH  are  delayed  a  finite  amount  of  time  in  order  to 
enforce  the  invariants.  As  a  result,  the  complexity  of  PATH  and  MPATH 
are  essentially  the  same  and  are  therefore  collectively  analyzed. 

The  storage  complexity  is  the  amount  of  table  space  needed  at  a 
node.  Each  one  of  the  N'  neighbor  tables  and  the  main  distance  ta¬ 
ble  has  size  of  the  order  0(|Ar|)  and  the  main  link  table  T’  can  grow, 
during  execution  of  MTU,  to  size  at  most  |A'!|  times  0(|iV|).  The 
storage  complexity  is  therefore  of  the  order  0(\ An  ||AT| ). 

The  time  complexity  is  the  time  it  takes  for  the  network  to  converge 
after  the  last  link  cost  change  in  the  network.  To  determine  time  com¬ 
plexity  we  assume  the  computation  time  to  be  negligible  as  compared  to 
the  communication  times.  If  tn  is  the  time  when  every  node  has  the  n- 
hop  minimum  tree,  because  every  node  processes  and  reports  changes 
in  finite  time  |f„+i  —  tn\  is  bounded.  Let  |f„+i  —  tn\  <  6  for  some 
finite  constant  6.  From  theorem  3,  the  convergence  time  can  be  at  most 
|iV|(9  and,  hence,  the  time  complexity  is  0(|JV|). 

The  computation  complexity  is  the  time  taken  to  build  the  node’s 
shortest  path  tree  in  T'  from  the  neighbor  tables  T'k.  Updating  of  Tl 
with  T'k  information  is  0(|iV8||jV|)  operation  and  running  Dijkstra  on 
Tl  takes  0(\ N'  \  \ N\log(\ AT|)).  Therefore  the  computation  complexity 
is  0(|Ar,||Ar|  4-  | TV* 1 1 7V| Zop( 1 7V| ) ) . 

The  communication  complexity  is  the  number  of  update  messages 
required  for  propagating  a  set  of  link-cost  changes.  The  analysis  for 
multiple  link-cost  changes  is  complex  because  of  the  sensitivity  to  the 
timing  of  the  changes.  So,  we  provide  the  analysis  only  for  the  case  of 
single  link-cost  change.  A  node  removes  a  link  from  its  shortest  path 
tree  if  only  a  shorter  path  using  two  or  more  links  is  discovered  and 
once  discovered  the  path  is  remembered.  Therefore,  a  removed  link 
will  not  be  added  again  to  the  shortest  path  which  means  that  a  link 
can  be  included  and  deleted  from  the  shortest  path  by  a  node  at  most 
one  time.  Because  nodes  report  each  change  only  once  to  each  neigh¬ 
bor.  an  update  message  can  travel  only  once  on  a  link  and  therefore  the 
number  of  messages  sent  by  a  node  can  be  at  most  0(\E\).  For  cer¬ 
tain  topologies  and  sensitively  timed  sequence  of  link  cost  changes  the 
amount  of  communication  required  by  PATH  can  be  exponential.  Hunt- 
blet  [8]  provides  an  example  that  exhibits  such  behavior,  and  though 
PATH  is  different  from  the  shortest-path  algorithm  presented  in  that  pa¬ 
per,  we  note  that  PATH  is  not  immune  from  such  exponential  behavior. 
However,  we  believe  such  scenarios  require  sensitively  timed  link-cost 
changes  which  are  very  unlikely  to  occur  in  practice.  If  necessary,  a 
small  hold-down  time  before  sending  update  messages  may  be  used  to 
prevent  such  behavior. 

V.  Concluding  Remarks 

We  have  presented  the  first  routing  algorithm  based  on  distance  in¬ 
formation  that  provides  multiple  paths  that  need  not  have  equal  costs 
and  that  are  loop-free  at  every  instant,  without  requiring  inter-nodal 
synchronization  spanning  more  than  one  hop.  The  loop-free  invariant 
conditions  presented  here  are  quite  general  and  can  be  used  with  ex¬ 
isting  internet  protocols.  The  multiple  successors  that  MPATH  makes 
available  at  each  node  can  be  used  for  traffic  load-balancing,  which  as 
we  have  shown  using  other  algorithms  (MPDA  [16])  is  necessary  for 
minimizing  delays  in  a  network.  MPATH  can  therefore  be  used  as  an 
alternative  to  MPDA  to  get  similar  performance.  In  a  future  work  we 
intend  to  compare  the  performance  of  the  three  multipath  routing  al¬ 
gorithms  MPATH,  MPDA  and  DASM[17]  in  terms  of  control  message 
overhead  and  convergence  times  and  analyze  their  relative  merits. 
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Appendix 

Proof  of  Theorem  2: 

Let  HJ,  denote  an  n-hop  minimum  tree  rooted  at  node  i  in  G  and 
let  MJ,  be  the  set  of  nodes  that  are  within  n  hops  from  i  in  HJ,.  Let 
D„J  denote  the  distance  of  i  to  j  in  HJ,.  Let  dy  be  the  cost  of  the  link 
i  — >  j.  Node  i  is  called  the  head  of  the  link  i  — >  j.  The  notation  i  j 
indicates  a  path  from  i  to  j  of  zero  or  more  links;  if  the  path  has  zero 
links,  then  i  =  j.  The  length  of  path  i  j  is  the  sum  of  costs  of  all 
links  in  the  path. 

Property  1:  From  the  principle  of  optimality  (the  sub-path  of  a 
shortest  path  between  two  nodes  is  also  the  shortest  path  between  the 
end  nodes  of  the  sub-path),  if  H  and  H'  are  two  n-hop  minimum  trees 
rooted  at  node  i  and  M  and  M'  are  sets  of  nodes  that  are  within  n  hops 
from  i  in  H  and  H'  respectively,  then  M  =  M'  =  MJ,  and  MJ,  >  n. 
For  each  j  G  MJ,  the  length  of  path  i  j  in  both  H  and  H'  is  equal 
to  DjfL  For  h  >  n,  Djjj  <  Dj,J. 

Let  A’  =  (Jfc  j  Ak,  where  Ak  is  the  set  of  nodes  in  Tk.  Because 
T’l  is  at  least  an  ( n  —  l)-hop  minimum  tree  and  node  i  can  appear  at 
most  once  in  each  of  A\,  each  A’k  has  at  least  n  —  1  unique  elements. 
Therefore.  A'  has  at  least  n  —  1  elements. 

Let  M'n  be  the  set  of  n  —  1  nearest  elements  to  node  i  in  A1.  That 
is,  M'n  C  A\  \Mi\  =  n  -  1.  and  for  each  j  G  M’n  and  v  G  -  Mi, 
min{D)k  +  Vk\ k  G  JV!}  <  min{D‘vk  +  l\\k  G  N1}. 

To  prove  the  theorem  it  is  sufficient  to  prove  the  following: 

1 .  Let  GJ,  represent  the  graph  constructed  by  MTU  on  lines  2  and  3. 
(i.e.,  before  applying  Dijkstra  in  line  4).  For  each  j  G  M’n  there 
is  a  path  i  j  in  GJ,  such  that  its  length  is  at  most  Dj,J . 

2.  After  running  Dijkstra  on  GJ,  on  line  4  in  MTU,  the  resulting  tree 
is  at  least  an  n-hop  minimum  tree. 

Let  us  first  assume  part  1  is  true  and  prove  part  2  because  it  is  simple. 
From  the  statement  in  part  1  for  each  node  j  G  M'n  there  is  a  path 
i  j  in  GJ,  with  length  at  most  Dj,’J .  In  the  resulting  tree  after  running 
Dijkstra,  we  can  infer  there  is  a  path  i  j  with  length  at  most  Dj,J. 
Because  there  are  n  —  1  nodes  in  M* ,  the  tree  constructed  has  at  least 
n  nodes  including  node  i.  From  property  1,  it  follows  that  the  tree 
constructed  is  at  least  an  n-hop  minimum  tree. 

We  now  prove  part  1.  Order  the  nodes  in  AT*  in  non-decreasing 
order.  The  proof  is  by  induction  on  the  sequence  of  elements  in  M'n. 
The  base  case  is  true  because  for  mi,  the  first  element  of  M1,  l’mi  = 
min{Vk\k  G  N1}  and  Vmi  =  Dj’mi.  As  induction  hypothesis,  let  the 
statement  hold  for  the  first  m  —  1  elements  of  AT* .  Consider  the  m-th 


element  j  G  Mln.  Let  K  be  the  highest  priority  neighbor  for  which 
D’jK  +  VK  =  min{Djk  4-  Vk\ k  G  N’}.  At  most  nn  —  1  nodes  in  Tf 
can  have  lesser  or  equal  distance  than  j  which  implies  path  K  j 
exists  with  at  most  m  —  1  hops.  Let  v  be  the  neighbor  of  j  in  Tf .  Then 
the  path  K  v  — >■  j  has  at  most  m  —  1  hops.  Because  Tk  is  at  least  a 
( n  —  l)-hop  minimum  tree,  the  link  v  — >  j  must  agree  with  G.  Since 
D’vK  +  Tk  <  I  fit.  +  Ik ,  from  induction  hypothesis  there  is  a  path 
i  v  in  GJ,  such  that  the  length  is  at  most  Dj;v . 

Now  we  need  to  show  that  the  preferred  neighbor  for  v  is  also  K, 
so  that  the  link  v  — >  j  will  be  included  in  the  construction  of  GJ, ,  thus 
ensuring  the  existence  of  the  path  i  j  in  GJ, .  If  some  neighbor  K1 
other  than  K  is  the  preferred  neighbor  for  v  then  one  of  the  following 
two  conditions  should  hold:  (a)  D'vK,  +  VKi  <  D'vK  -f  llK  or  (b) 
D'v  K1  +  VKI  =  D’vK  +  l’K  and  priority  of  K'  is  greater  than  priority  of 
K. 

Case  (a):  Because  D'jK  +  VK  <  Dl-K,  +  VK,  it  follows  that  the  path 
v  j  in  Tk'  is  greater  than  cost  of  v  — >■  j  in  G  which  implies  that 
Tk <  is  not  a  ( n  —  1)  hop  minimum  tree  -  a  contradiction  of  assumption. 
Therefore  D\,K  +  if  =  min{D‘vk  +  Vk\ k  G  N% 

Case  (b):  Let  Qj  be  the  set  of  neighbors  that  give  the  minimum 
distance  for  j,  i.e.,  for  each  k  G  Qj,  Df  +  Vk  =  min{Djk  +  l'k\k  G 
N' } .  Similarly,  let  Qv  be  such  that  for  each  k  G  Qv,  D'vk  -F  l\  = 
min{Dlvk  +  l\\k  G  An).  If  k  G  Qv  and  k  ^  Qj,  then  it  follows 
from  same  argument  as  in  case  (a)  that  v  j  in  T’k  is  greater  than 
cost  of  v  — »  j  in  G  implying  Tk  is  not  a  {n  —  l)-hop  minimum  tree 
-  a  contradiction  of  assumption.  Because  K  has  the  highest  priority 
among  all  members  of  Qj  and  Qv  C  Qj  and  k  G  Qv.  K  also  has 
the  highest  priority  among  all  members  of  Qv.  Therefore  Qv  C  Qj. 
Also,  from  the  same  argument  it  can  be  inferred  that  K  G  Qv .  This 
proves  that  v  — >  j  will  be  included  in  the  construction  of  GJ,.  Because 
Dj;v  +  dvj  =  D„J  in  G,  where  dvj  is  the  final  cost  of  link  v  — >  j, 
and  length  of  i  v  in  GJ,  is  less  than  or  equal  to  DJ;V  from  induction 
hypothesis,  we  have  length  of  i  j  in  GJ,  less  than  or  equal  to  Dj,J . 
This  proves  part  1  of  the  theorem. 

Proof  of  Theorem  3: 

The  proof  is  by  induction  on  tn,  the  global  time  when  for  each  node 
i,  Tl  is  at  least  n-hop  minimum  tree.  Because  the  longest  loop-free  path 
in  the  network  has  at  most  N  —  1  links  where  N  is  number  of  nodes  in 
the  network,  Lv-i  is  the  time  when  every  node  has  the  shortest  path  to 
every  other  node.  We  need  to  show  that  fjv-i  is  finite.  The  base  case  is 
1 1 .  the  time  when  every  node  has  1-hop  minimum  distance  and  because 
the  adjacent  link  changes  are  notified  within  finite  time,  t  \  <  oo.  Let 
tn  <  oo  for  some  n  <  N.  Given  that  the  propagation  delays  are  finite 
each  node  will  have  each  of  its  neighbors  n-hop  minimum  tree  in  finite 
time  after  f  „ .  From  Theorem  2  we  can  see  that  the  node  will  have  at 
least  the  (n  +  l)-hop  minimum  tree  in  finite  time  after  tn.  Therefore, 
<  oo.  From  induction  we  can  see  that  Lv-i  <  oo. 

Proof  of  Theorem  6: 

The  convergence  of  MPATH  follows  directly  from  the  convergence 
of  PATH  because  the  update  messages  in  MPATH  are  only  delayed 
a  finite  time  as  allowed  at  line  4  in  algorithm  PATH.  Therefore,  the 
distances  DJ  in  MPATH  also  converge  to  shortest  distances.  Because 
changes  to  D J  are  always  reported  to  the  neighbors  and  are  incorpo¬ 
rated  by  the  neighbors  in  their  tables  in  finite  time  D‘-k  =  Dj  for 
k  G  Nl  after  convergence.  From  line  3a  in  MPATH,  we  observe  that 
when  node  i  becomes  passive  F  D’  =  Dj  holds  true.  Because  all  nodes 
are  passive  at  convergence  it  follows  that  S j  =  \k\D\h  <  FD\,  k  G 
N’}  =  {k\Dj  <  Dfk  G  N'}. 


